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Abstract 

We present a rapid method for the exact calculation of the cumulative distribution 
function of the maximum of multinomially distributed random variables. The method 
runs in time 0(mn), where m is the desired maximum and n is the number of variables. 
We apply the method to the analysis of two situations where an apparent clustering of 
cases of a disease in some locality has raised the possibility that the disease might be 
communicable, and this possibility has been discussed in the recent literature. We con- 
clude that one of these clusters may be explained on purely random grounds, whereas 
the other may not. 



1 Introduction 

It happens, from time to time, that cases of a disease will cluster both geographically and in 
time, in a manner which seems not to be random, and which invites further epidemiological 
study regarding communicability of the disease. 
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Of course mathematics alone cannot answer serious questions of public health, but it can 
provide guidelines about what sort of clustering should be regarded as unusual, and what 
sort is to be expected. In particular, the calculation of a P-value is required for an objective 
assessment of any observed event. In this paper we provide a rapid and exact P-value 
calculation for the standard "balls-in-boxes" model appropriate to the disease clustering 
situation. 

2 The model 

Suppose that during a certain time period, a number r of cases of some disease arise randomly 
in some large population, such as that of the the U.S. Let N be size of that population and 
A^o be the population of the community in which the seemingly large number of cases has 
occurred. 

We think of the entire country as consisting of n = N/Nq identical communities, or cells, 
each containing A^o people, and we ask: 

If r cases occur randomly in the populations of n communities of the same size, 
what is the probability that no community gets more than m cases of the disease? 

The standard calculation required to answer this question involves the "balls-in boxes" 
model, discussed below. If, for example, it turns out that it is extremely likely that some 
community of equivalent size to that where the seemingly large number of cases occurred, 
purely by chance, we could conclude that the observed cluster would not be a cause for 
further investigation or suspicion of communicability. Likewise, if it turns out that it is 
extremely unlikely that, by chance, any community of the size of that of interest would have 
the observed number of cases of the disease, then support would be given to the possibility 
of a public health hazard. 

3 The mathematics 

Mathematically speaking, we have r "balls" (the disease cases) being dropped randomly 
into n labeled "boxes" (the communities). The relevant calculation thus concerns the P- 
value associated with the box (or boxes) having the largest number of balls in it. It is 
well known that the distribution function of the maximum of a number of random variables 
changes sharply near the mean of the maximum, so that an exact rather than an approximate 
calculation is needed to find this P-value. We provide this exact calculation in this paper. 

The P-value associated with an observed value m of cases of the disease in the community 
of interest is the probability that the maximum number of balls in any box in m or more. 
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We find this probability by first finding the probabihty that no box contains more than m 
balls. Denote this probability by P{r,n,m). 

Now, the probability that there are ri balls in box 1, and r2 in box 2, and . . . , and r„ in 
box n, is given by the well known multinomial distribution, 

1 r' 

Pr(ri,r2, . . . ,rn) = — — j— j r. (r = ri + . . . + r„) (1) 

ri!r2! . . . r„! 

The probability that no box contains more than m balls (i.e., the cumulative distribution 
function of the maximum of the r,, evaluated at m) is 

1 r' 

P(r,n,m) = Pr(all are < m) = ^ — ^ ■ , " ^ , ■ (2) 

0<ri,r2 r„<m ' 1 • ' 2 • • • • ' ?1 • 

ri+r2 + ... + r„=r 

4 The computation 

At first sight the expression ^ seems appallingly complicated for exact computation, if r 
and n are large. Various approximations, such as the Poisson approximation, have been used 
by researchers in order to avoid the apparently tedious computation in 

However the exact calculation can be completely tamed by two steps. First we introduce 
the function 

e^(x) = l + x + - + - + ... + -, 

which is simply the mth section of the exponential series. Then P{r, n, m) is rl/ri'' times the 
coefficient of in the series em{x)^. The question of computing a particular coefficient of 
a high power of a given power series is a well studied problem in computer science, and the 
following solution, which makes the computation quite rapid and easy to program, is taken 
from [5] (chap. 21). 

Let /(x) = J2j O'jX'' be a given power series and let h{x) = /(x)". The question is, if 
h{x) = J2j hjx\ how can we economically compute the /i/s from the given a/s7 We begin 
by taking logarithms of the equation h = to get logh{x) = nlog/(x). Now differentiate 
both sides with respect to x to obtain h' /h = nf /f, and cross multiply to eliminate fractions, 
yielding fh' = nhf. Next insert the power series expansions of the various functions into 
this equation, and multiply both sides by x, for cosmetic reasons, to get 
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Finally, equate the coefficients of a given power of x, say x*, on both sides of the last equation, 
which gives, 

s s 

1=0 1=0 

This is a recurrence relation. We can use it to compute the unknown h/s successively, 
in the order ho, hi, h2, . . .. To make this explicit, we can rewrite the above in the form 

/i. = — + (s = 1,2,3,...) (3) 

In this form it is clear that each hg is determined from ho, hi, ... , hs_i. 

In the particular case at hand, of powers of the truncated exponential series ek{x), we 
have aj = for < j < m, and aj = for all other values of j. The recurrence takes the 
form 

-| min (s,m) 7 

hs = - E iin + l)i-s)^. (. = 1,2,3,...) (4) 

We summarize the calculation procedure as follows. To compute P(r, n, m) as defined by 
eq. (j2I) above, 

• Take h^ = 1 and successively compute hi, h2, . . . , hr from (jH). 

• Then P(r,n,m) =r\hr/n^. 

A remarkable feature of this algorithm is that the computation of each hg requires the 
knowledge of only m earlier values, so the entire computation can be done with just m units 
of array storage. For example, it can find the probability that the maximum is < 8, for 
15000 balls in 10000 boxes using only 8 array storage locations. In summary, it runs in time 
which is 0{mn) and uses only 0{m) storage. 

We remark that as we have presented it this method works only for the situation in which 
the cells have equal probabilities. It can be extended, at only a small extra cost, to the case 
of unequal probabilities, which may be useful for power calculations. 

5 Some related work 

The problem of finding the distribution of the maximum occupancy in a balls-and-cells 
problem is very old. Already in Barton and David [1 one finds the first observation above, 
namely that the desired probability is a certain coefficient in a power of a given power 
series. In this observation of Barton and David is cited, and is said to be "not in a form 
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convenient for computing," which is true absent our second step above, in eq. of vastly 
accelerating the computation of the high power of the given series. 

Freeman's algorithm in [2] sought to economize the computation by grouping together 
vectors of occupancy numbers which, as unordered multisets, were the same. Hence he 
listed partitions with given largest size part, and counted the occupancies of that subset of 
all partitions. This is a large amount more labor than our method above, which requires 
computing time roughly proportional to the square of the number of "balls," whereas earlier 
methods required exponential time. 

Likewise the recurrence (j^) for computing powers of power series has a long history. 
Although we have followed j3] in our presentation, the recurrence method was certainly not 
invented by them, and is described in several earlier works. Nonetheless, the concatenation of 
the two methods in connnection with finding the distribution of the maximum cell occupancy 
seems to be new. 

6 Leukemia: two examples 

Example 1. We consider first the much discussed case (see and 0) of childhood leukemia 
in Niles, IL in the five year period 1956-1960. Heath gives a total of eight cases in this 
town during this period, as compared to an expected number of 1.6. In 1960 the population 
of Niles was about 20,000 people. The total population of the U.S. in 1960 was approximately 
180,000,000 people. Therefore the U.S. population in 1960 can be thought of as consisting of 
9,000 cells, the population of each being 20,000 people. An expected number of 1.6 in Niles 
would then correspond to a total of about 14,400 cases in the U.S. in the five year period 
studied. 

Using the formula above, we therefore computed the exact probability that if 14,400 balls 
are distributed randomly into 9000 cells, then no cell will get more than m balls, for each 
m = 6, . . . , 12, and in particular for m = 8. We also computed the P- value for each of these 
values of m using the fact that the P-values corresponding to an observed maximum of m 
is given by 1 - P(14400, 9000, m - 1). 

For comparison, we ran a Monte Carlo computer experiment in which we repeated 1000 
times the operation of distributing 14400 balls randomly into 9000 cells, and recorded the 
frequencies of the maximum occupancy numbers, thus giving an empirical distribution func- 
tion for m. (1000 replications are needed to give an estimate of the P-value for m = 8 that is 
accurate to within ±0.01 with probability 0.95.) The results of the this simulation, and the 
exact P(14400, 9000, m) computations are shown below, together with the exact P-values. 
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m P(14400, 9000, m) Monte Carlo P-value 



6 
7 
8 
9 

10 
11 
12 



0.000005 
0.095395 
0.664954 
0.937864 
0.990843 
0.998788 
0.999852 



0.000 
0.096 
0.678 
0.944 
0.993 
0.998 
0.999 



1.000000 
0.999995 
0.904605 
0.335046 
0.062136 
0.009157 
0.001212 



The computation of the above table of exact values, on a PC running the computer algebra 
system Maple, ^ required less than five seconds. The Monte Carlo computation required 
about thirty minutes. In both the Monte Carlo simulation and the exact calculations we 
observe the expected rapid change of P-values as m increases, emphasizing the need for exact 
P-value calculations as discussed above. 

We conclude from the above that the probability that some cell of population 20,000 
would have gotten 8 or more cases in the five year period studied is about 90 percent. Thus 
the Niles data do not appear, so far as formal P-value calculations are concerned, to show a 
significant cluster of cases of childhood leukemia. 

Example 2. Twelve cases of acute lymphocytic leukemia were observed p] in Churchill 
County, NV, among persons who had been residents of the county at the time of diagnosis, 
in the three year period 1999-2001. Concern was expressed that this was due to the exposure 
to some agent associated with a nearby naval air station. At that time the county had a 
population of approximately 24,000. The entire U.S. had a population of approximately 
288,000,000, equivalent to 12,000 units, or cells, each of the size of Churchill County. The 
State Epidemiologist, Dr. Randall Todd, estimated that, based on its population, about one 
case would be expected in Churchill County every five years. If we use that estimate, the 
incidence in the U.S. as a whole would be 12000 cases per five years, or 8000 cases per three 
year period. 

In this case we need the distribution function of the maximum number of balls in any 
cell if 8000 balls are thrown at random into 12,000 cells. The results are as shown below. 



Program available on request 
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m P(8000, 12000, m) P-value 



4 
5 
6 
7 
8 



0.000472 
0.436361 
0.925122 
0.993604 
0.999528 



1.000000 
0.999528 
0.563639 
0.074878 
0.006396 



Clearly the observed incidence of twelve cases in Churchill county cannot reasonably be 
ascribed to chance, and further epidemiological investigation is warranted. 

7 Further comments on the P-value 

The P-value corresponding to any value of m in the balls-in-boxes case can in principle 
be calculated exactly using standard inclusion/exclusion formulae. In practice this seems 
extremely difficult, because the alternating signs can cause catastrophic loss of significant 
digits. A Poisson approximation is also possible but may be inaccurate, particularly around 
the tails of the distribution. Our exact method, described in eq. (jH) above, is fast and does 
not suffer from any of those problems. 

A further comment about P-values is more wide-ranging. Many diseases might come to 
our attention because of an apparent clustering in some location in some time period. Also, 
many different time periods might be potentially observed. An overall P-value calculation, 
taking these matters into consideration, would be desirable, but in practice would probably 
be impossibly difficult, since no precise value can be attached to "the number of diseases 
that might come to our attention" or, possibly, to the number of time periods that we might 
have considered. 

8 A disclaimer 

Mathematics cannot prove or disprove the communicability of a disease process. It can only 
help to define the word "unusual." The benchmark given above seems like an appropriate 
one to use when investigating an outbreak which is localized spatially, temporally, or both. 
By this benchmark, the clustering of leukemia cases in Niles, Illinois, between 1956 and 1960 
was not unusual. In fact some collection of that number of cases in some community the size 
of Niles, in a five year period of keeping records, was to be expected with high probability. 
On the other hand, the Churchill County data seem extremely significant. 
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